perm filename MACHR[4,KMC]1 blob sn#006442 filedate 1972-10-17 generic text, type T, neo UTF8
00100	COLBY AND MORAVEC
00200	
00300	
00400	CONTEXT-SENSITIVE  FEATURE  RECOGNITION FOR COMPUTER UNDERSTANDING OF
00500	NATURAL LANGUAGE IN TELETYPED DIALOGUES
00600	
00700	
00800	WHY  IS  IT SO DIFFICULT FOR MACHINES TO UNDERSTAND NATURAL LANGUAGE?
00900	IT IS BECAUSE THEY DO NOT SIMULATE SUFFICIENTLY  WHAT  PEOPLE  DO  WHEN
01000	PEOPLE  PROCESS  LANGUAGE.  MANY  YEARS  OF  EXPERIENCE WITH COMPUTER
01100	SCIENCE AND LINGUISTIC APPROACHES HAVE TAUGHT US THE  SCOPE       AND
01200	LIMITATIONS OF SYNTACTICAL, SEMANTIC AND CONCEPTUAL PARSING.[THORNE &
01300	BRATLEY] [SIMMONS]  [SCHANK][WILKS][WOODS][WINOGRAD]. WHILE CONVENTIONAL
01400	PARSERS  PERFORM  SATISFACTORILY  WITH  EDITED TEXT SENTENCES OR WITH
01500	EXPRESSIONS LIMITED TO A TOY WORLD, THEY ARE INADEQUATE FOR  EVERYDAY
01600	LANGUAUGE  BEHAVIOR  SUCH AS TAKES PLACE BETWEEN TWO PEOPLE WHEN THEY
01700	CONVERSE. IN AN UNDERSTANDBLY RATIONALISTIC QUEST FOR  CERTAINTY  AND
01800	ATTRACTED  BY  AN ANALOGY FROM THE PROOF THEORY OF LOGICIANS IN WHICH
01900	PROVABILITY IMPLIED COMPUTABILITY, COMPUTATIONAL LINGUISTS  HOPED  TO
02000	DEVELOP  CONTEXT-FREE FORMALISMS FOR NATURAL LANGUAGE.   BUT THE HOPE
02100	HAS NOT BEEN REALIZED AND PERHAPS IN PRINCIPLE  CANNOT  BE.   (IT  IS
02200	DIFFICULT  TO  FORMALIZE  SOMETHING  WHICH CAN HARDLY BE FORMULATED).
02300	IN THEIR DIALOGUES HUMANS ARE NEVER  CONTEXT-FREE  LINGUISTICALLY  OR
02400	CONCEPTUALLY. THE MAIN PROBLEM IS HOW TO MODEL THIS CONTEXT-SENSITIVITY.
02500	
02600	LINGUISTIC  PARSERS  USE  MORPHEMIC  ANALYSIS  TO  OBTAIN WORD-ROOTS,
02700	PARTS-OF-SPEECH  ASSIGNMENTS  AND  DICTIONARIES  CONTAINING  MULTIPLE
02800	WORD-SENSES   ALONG   WITH  SEMANTIC  FEATURES  WHICH  RESTRICT  WORD
02900	COMBINATIONS. THEY PERFORM A WORD-BY-WORD ANALYSIS OF EVERY WORD, VALIANTLY
03000	DISAMBIGUATING  AT  EACH STEP IN AN ATTEMPT TO CONSTRUCT A MEANINGFUL
03100	INTERPRETATION. WHILE SOPHISTICATED COMPUTATIONALLY,  SUCH  A  PARSER
03200	BECOMES PARALYZED BY QUITE ORDINARY CONVERSATION.         IN EVERYDAY
03300	DISCOURSE PEOPLE SPEAK COLLOQUIALLY AND IDIOMATICALLY USING ALL SORTS OF PAT
03400	PHRASES  (`YOU  SAID  IT'), SLANG (`LETS RAP') AND CLICHES (`THATS THE
03500	WAY IT GOES').  THEY  ARE  CRYPTIC  AND  ELLIPTIC.  THEY  LACE  THEIR
03600	UTTERANCES  WITH  MUMBLES  (`MM-AH'), FUZZ (`WELL NOW LETS SEE') AND
03700	FRAGMENTS(`REALLY').THEY CONVEY THEIR INTENTIONS AND  IDEAS  IN  BOTH
03800	IDIOSYNCRATIC  AND  METAPHORICAL  WAYS,  BLITHELY  VIOLATING RULES OF
03900	'CORRECT' GRAMMAR AND SYNTAX.      GIVEN THESE DIFFICULTIES,  HOW  IS
04000	IT  THAT  PEOPLE CARRY ON CONVERSATIONS EASILY MOST OF THE TIME WHILE
04100	MACHINES HAVE FOUND  IT  EXTREMELY  DIFFICULT  TO  CONTINUE  TO  MAKE
04200	CONCEPTUALLY APPROPRIATE REPLIES        WHICH  COMMUNICATE 
04300	UNDERSTANDING.     THE OPERATIONS OF  CURRENT  PARSERS HAVE BEEN
04400	THOUGHTFULLY REVIEWED BY WINOGRAD [ ].
04500	
04600	
04700	IT SEEMS THAT PEOPLE 'GET THE MESSAGE' WITHOUT ANALYZING EVERY SINGLE
04800	WORD  IN  THE  INPUT.     PEOPLE MAKE INDIVIDUALISTIC SELECTIONS FROM
04900	HIGHLY REDUNDANT AND  REPETITIOUS  COMMUNICATIONS.   THESE  SELECTIVE
05000	OPERATIONS  PRODUCE  A  TRANSFORMATION OF THE INPUT BY DESTROYING AND
05100	EVEN DISTORTING INFORMATION. IN SPEED READING,  FOR  EXAMPLE,  ONLY  A
05200	SMALL  PERCENTAGE OF CONTENTIVE WORDS ON EACH PAGE NEED BE LOOKED AT.
05300	THESE   WORDS   SOMEHOW   RESONATE   WITH   THE   READERS    RELEVANT
05400	CONCEPTUAL-INFERENTIAL  STRUCTURE  WHOSE  PROCESSES   ENABLE  HIM  TO
05500	'UNDERSTAND' NOT SIMPLY THE LANGUAGE BUT ALL SORTS OF UNMENTIONED ASPECTS ABOUT
05600	THE  SITUATIONS  AND  EVENTS BEING REFERRED TO BY THE LANGUAGE.    IN
05700	WRITTEN TEXTS 5/6 OF THE INPUT CAN BE DISTORTED OR  DELETED  AND  THE
05800	INTENDED  MESSAGE  CAN  STILL  SUCCESSFULLY  BE  EXTRACTED.    SPOKEN
05900	CONVERSATIONS IN ENGLISH ARE KNOWN TO BE AT LEAST 50% REDUNDANT. HALF
06000	THE  WORDS  CAN  BE GARBLED AND LISTENERS NONETHELESS GET THE GIST OR
06100	DRIFT OF WHAT IS BEING  SAID.  (GIVE  FURTHER  EXPERIMENTAL  EVIDENCE
06200	HERE)
06300	
06400	TO  APPROXIMATE  SUCH  HUMAN  PERFORMANCES AN APPROACH DIFFERENT FROM
06500	THAT OF THE USUAL LINGUISTIC  PARSER   IS  REQUIRED.      THIS
06600	ALTERNATE APPROACH SHOULD INCORPORATE KNOWLEDGE GAINED FROM WORK WITH
06700	PARSERS  BUT  SHOULD  UTILIZE  PRIMARILY   CONCEPTUAL   RATHER   THAN
06800	GRAMMATICAL   FEATURES.   PARSERS   REPRESENT   COMPLEX  AND  REFINED
06900	ALGORITHMS.  WHILE ON ONE HAND THEY SUBJECT A SENTENCE TO A  DETAILED
07000	AND SOMETIMES OVERKILLING ANALYSIS, ON THE OTHER THEY ARE FINICKY AND
07100	OVERSENSITIVE.  FOR EXAMPLE, A LINGUISTIC PARSER SIMPLY  HALTS  IF  A
07200	WORD  IN  THE  INPUT  SENTENCE  IS  NOT  PRESENT  IN  ITS DICTIONARY.
07300	UNGRAMMATICAL EXPRESSIONS, FOR EXAMPLE DOUBLE PREPOSITIONS (`DO YOU  WANT  TO  GET  OUT  OF  FROM  THE
07400	HOSPITAL?')  ARE QUITE CONFUSING TO THEM.     ON INTUITIVE GROUNDS IT
07500	IS HARDLY CREDIBLE THAT PARSERS  MODEL  THE  MECHANISMS  PEOPLE  USE  IN
07600	PROCESSING  LANGUAGE. AS CHOMSKY[ ] HAS REMARKED, `WE NOTED AT THE
07700	OUTSET THAT PERFORMANCE AND COMPETENCE MUST BE  SHARPLY  DISTINGUIHED
07800	IF  EITHER  IS  TO  BE STUDIED SUCCESSFULLY.  WE HAVE NOW DESCRIBED A
07900	CERTAIN MODEL OF COMPETENCE.     IT  WOULD  BE  TEMPTING,  BUT  QUITE
08000	ABSURD, TO REGARD IT AS A MODEL OF PERFORMANCE AS WELL. THUS WE MIGHT
08100	PROPOSE THAT TO PRODUCE A  SENTENCE  THE  SPEAKER  GOES  THROUGH  THE
08200	SUCCESSIVE STEPS OF CONSTRUCTING A BASE-DERIVATION, LINE BY LINE FROM
08300	THE INITIAL SYMBOL S,  THEN  INSERTING  LEXICAL  ITEMS  AND  APPLYING
08400	GRAMMATICAL  TRANSFORMATIONS TO FORM A SURFACE STRUCTURE, AND FINALLY
08500	APPLYING THE PHONOLOGICAL RULES IN THEIR GIVEN ORDER,  IN  ACCORDANCE
08600	WITH THE CYCLIC PRINCIPLE DISCUSSED ABOVE. THERE IS NOT THE SLIGHTEST
08700	JUSTIFICATION FOR ANY  SUCH  ASSUMPTION.'  IT  IS  CLEAR  FROM  THESE
08800	REMARKS  THAT  THE  TRANSFORMATIONAL APPROACH HAS BEEN CONCERNED WITH
08900	PRODUCTION RATHER THAN INTERPRETATION OF SENTENCES AND THAT IT IS NOT
09000	ORIENTED  TOWARDS  HUMAN PERFORMANCE BUT TOWARDS AN IDEALIZED GRAMMAR
09100	OF COMPETENCE.
09200	
09300	EARLY  ATTEMPTS  TO  DEVELOP  A  FEATURE-RECOGNITION  APPROACH  USING
09400	SPECIAL-PURPOSE HEURISTICS ARE DESCRIBED IN [ ],[ ].  THE LIMITATIONS
09500	OF  THESE  ATTEMPTS  ARE  WELL  KNOWN  TO   WORKERS   IN   ARTIFICIAL
09600	INTELLIGENCE.   SUCH PRIMITIVE CONTEXT-RESTRICTED PROGRAMS GRASP A
09700	TOPIC WELL ENOUGH BUT TOO OFTEN DO NOT UNDERSTAND        OF WHAT IS
09800	BEING  SAID  ABOUT THE TOPIC.    THIS SHORTCOMING IS BOTH LINGUISTIC
09900	AND CONCEPTUAL.BECAUSE THE FEATURE- RECOGNITION OF SUCH PROGRAMS IS SIMPLISTIC  AND  THE
10000	PROGRAMS  LACK  A  RICH CONCEPTUAL STRUCTURE INTO WHICH THE PATTERN
10100	ABSTRACTED FROM THE INPUT CAN BE MATCHED FOR  FURTHER  INFERENCING,
10200	THE  MAN-MACHINE  CONVERSATIONS SOON BECOME
10300	IMPOVERISHED AND BORING. WINOGRAD`S PROGRAM ,WHILE LIMITED TO  A  FEW
10400	OBJECTS  AND  RELATIONS  IN  A  TOY ROBOTIC WORLD,REPRESENTED A GREAT
10500	IMPROVEMENT  IN  THE  FEATURE-RECOGNITION  APPROACH.  HOWEVER MANY OF  HIS
10600	FEATURES,SUCH  AS  DETERMINERS  AND  NOUN  GROUPS, WERE GRAMMATICALLY
10700	RATHER THAN CONCEPTUALLY ORIENTED. ANOTHER FEATURE-RECOGNITUION APPROACH IS
10800	THAT  OF  WILKS[  ]  WORKING  IN THE AREA OF MACHINE TRANSLATION. HIS
10900	ALGORITHM CONSTRUCTS A PATTERN  FROM  ENGLISH  TEXT  INPUT  WHICH  IS
11000	MATCHED  AGAINST TEMPLATES IN AN INTERLINGUAL DATA BASE FROM WHICH,IN
11100	TURN, FRENCH OUTPUT IS GENERATED WITHOUT USING A GENERATIVE GRAMMAR.
11200	
11300	IN THE COURSE OF CONSTRUCTING A COMPUTER SIMULATION  OF  PARANOIA  WE
11400	WERE FACED WITH THE PROBLEM OF DEALING WITH NATURAL LANGUAGE AS IT IS
11500	USED IN THE DOCTOR-PATIENT SITUATION OF A PSYCHIATRIC  INTERVIEW.THIS
11600	DOMAIN  OF  DISCOURSE  ADMITTEDLY CONTAINS MANY STEREOTYPES (`WHAT BROUGHT
11700	YOU TO THE HOSPITAL?') AND IS CONSTRAINED IN  TOPICS  (NEWTON`S  LAWS
11800	ARE RARELY DISCUSSED). BUT IT IS RICH ENOUGH IN VERBAL BEHAVIOR TO BE A CHALLENGE TO A
11900	LANGUAGE UNDERSTANDING ALGORITHM SINCE A GREAT VARIETY OF HUMAN  RELATIONS
12000	ARE  DISCUSSED  IN  THIS DOMAIN INCLUDING THAT WHICH DEVELOPS BETWEEN
12100	THE INTERVIEW PARTICIPANTS. THE JUDGEMENT OF 'PARANOIA'  IS  MADE  BY
12200	PSYCHIATRISTS   RELYING   MAINLY   ON  THE  VERBAL  BEHAVIOR  OF  THE
12300	INTERVIEWED PATIENT.  IF A   PARANOID MODEL IS  TO  EXHIBIT  PARANOID
12400	BEHAVIOR  IN  A PSYCHIATRIC INTERVIEW, IT MUST BE CAPABLE OF HANDLING
12500	DIALOGUES TYPICAL OF THE DOCTOR-PATIENT CONTEXT.    SINCE  THE  MODEL
12600	CAN COMMUNICATE ONLY THROUGH TELETYPED MESSAGES,THE VIS-A-VIS ASPECTS
12700	OF THE USUAL PSYCHIATRIC INTERVIEW ARE ABSENT. THUS THE MODEL  SHOULD
12800	BE ABLE TO DEAL WITH TYPEWRITTEN NATURAL LANGUAGE INPUT AND TO OUTPUT
12900	REPLIES WHICH  ARE  INDICATIVE  OF  AN  UNDERLYING  PARANOID  THOUGHT
13000	PROCESS.
13100	
13200	IN  A PSYCHIATRIC INTERVIEW THERE IS ALWAYS A WHO SAYING SOMETHING TO
13300	A WHOM WITH DEFINITE INTENTIONS AND EXPECTATIONS. THERE ARE TWO SITUATIONS  
13400	TO BE TAKEN INTO ACCOUNT, THE ONE BEING TALKED ABOUT AND THE ONE THE PARTICIPANTS ARE IN.
13500	SOMETIMES THE LATTER BECOMES  THE  FORMER.  AS  WEIZENBAUM  [  ]  HAS
13600	EMPHASIZED  FOR  COMPUTER  SCIENTISTS,  DIALOGUES  HAVE  PURPOSES AND
13700	MACHINES MUST RECOGNIZE THIS FACT. THE DOCTOR'S PURPOSE IS TO  GATHER
13800	CERTAIN  KINDS  OF INFORMATION WHILE THE PATIENT'S PURPOSE IS TO GIVE
13900	INFORMATION AND GET HELP.THAT IS, A JOB IS TO BE DONE. OUR WORKING HYPOTHESIS IS
14000	THAT  EACH  PARTICIPANT  IN  THE  DIALOGUE  UNDERSTANDS  THE OTHER BY
14100	MATCHING SELECTED SIGNIFICANT FEATURES IN THE  INPUT  AGAINST  STORED
14200	CONCEPTUAL  PATTERNS WHICH CONTAIN INFORMATION ABOUT THE SITUATION OR
14300	EVENT BEING  DESCRIBED  LINGUISTICALLY.       THIS  UNDERSTANDING  IS
14400	COMMUNICATED  RECIPROCALLY BY LINGUISTIC RESPONSES JUDGED APPROPRIATE
14500	TO THE INTENTIONS AND EXPECTATIONS OF THE PARTICIPANTS.IN THIS PAPER WE SHALL DESCRIBE
14600	ONLY  THE  CONTEXT-SENSITIVE  FEATURE-RECOGNITION  PROCESSES  USED TO
14700	EXTRACT  A  PATTERN  FROM   NATURAL   LANGUAGE   INPUT.IN   A   LATER
14800	COMMUNICATION WE SHALL DESCRIBE THE INFERENTIAL PROCESSES CARRIED OUT
14900	AT THE CONCEPTUAL LEVEL ONCE THE `PARADIGMATIC' PATTERN HAS BEEN RECEIVED  FROM  THE
15000	FEATURE-RECOGNITION PROCESSES.
15100	
15200	
15300	(HANS WRITES DESCRIPTION OF HIS FEATURE RECOGNIZER)